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replicated  in  different  regions  of  the  brain  and  across  modalities.  The  main  points  of  the 
theory  are: 
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1  The  Grandmother  Neuron  Theory 

A  classical  theme  in  the  neurophysiological  literature  at  least  since  the 
work  of  Hubei  and  Wiesel  (1962)  is  the  idea  of  information  processing  in 
the  brain  as  leading  to  “grandmother”  neurons  responding  selectively  to 
the  precise  combination  of  visual  features  that  are  associated  with  one’s 
grandmother.  The  “grandmother”  neuron  theory  is  of  course  not  restricted 
to  vision  and  applies  as  well  to  other  sensory  modalities  and  even  to  mo¬ 
tor  control  under  the  form  of  cells  corresponding  to  elemental  movements. 
Why  is  this  idea  so  attractive?  The  idea  is  attractive  because  of  its  sim¬ 
plicity:  it  replaces  complex  information  processing  with  the  superficially 
simpler  task  of  accessing  a  memory.  The  problem  of  recognition  and  motor 
control  would  be  solved  by  simply  accessing  look-up  tables  containing  ap¬ 
propriate  descriptions  of  objects  and  of  motor  actions.  The  human  brain 
can  probably  exploit  a  vast  amount  of  memory  with  its  1014  or  so  synapses, 
making  attractive  any  scheme  that  replaces  computation  with  memory.  In 
the  case  of  vision  the  apparent  simplicity  of  this  solution  hides  the  diffi¬ 
cult  problems  of  an  appropriate  representation  of  an  object  and  of  how  to 
extract  it  from  complex  images.  But  even  assuming  that  these  problems 
of  representation,  feature  extraction  and  segmentation  could  be  solved  by 
other  mechanisms,  a  fundamental  difficulty  seems  to  be  intrinsic  to  the 
“grandmother”  cell  idea.  The  difficulty  consists  of  the  combinatorial  ex¬ 
plosion  in  the  number  of  cells  that  any  scheme  of  the  look-up  table  type 
would  reasonably  require  for  either  vision  or  motor  control.  In  the  case 
of  3D  object  recognition,  for  instance,  there  should  be  for  each  object  as 
many  entries  in  the  look-up  table  as  there  are  2-D  views  of  the  object,  in 
principle  an  infinite  number. 

The  difficulty  of  a  combinatorial  explosion  lies  at  the  heart  of  theories 
of  intelligence  that  attempt  to  replace  information  processing  with  look¬ 
up  tables  of  precomputed  results.  In  this  paper  we  suggest  a  scheme  that 
avoids  the  combinatorial  problem,  while  retaining  the  attractive  features  of 
the  look-up  table.  The  basic  idea  is  to  use  only  a  few  entries  and  interpolate 
or  approximate  among  them.  A  mathematical  theory  based  on  this  idea 
leads  to  a  powerful  scheme  of  learning  from  examples  that  is  equivalent 
to  a  parallel  network  of  simple  processing  elements.  The  scheme  has  an 
intriguingly  simple  implementation  in  terms  of  plausible  biophysical  mech¬ 
anisms.  We  will  discuss  in  particular  the  case  of  3D  object  recognition 
but  will  propose  that  the  scheme  is  possibly  used  by  the  brain  for  several 
different  information  processing  tasks.  Many  information  processing  prob¬ 
lems  can  be  represented  as  the  composition  of  one  or  more  multivariate 
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functions  that  map  an  input  signal  into  an  output  signal  in  a  smooth  way. 
These  modules  could  be  synthesized  from  a  sufficient  set  of  input-output 
pairs  -  the  examples  -  by  the  scheme  described  here.  Because  of  the  power 
and  general  applicability  of  this  mechanism,  we  speculate  that  a  part  of 
the  machinery  of  the  brain  -  including  perhaps  some  of  the  cortical  cir¬ 
cuitry  which  is  somewhat  similar  across  the  different  modalities  -  may  be 
dedicated  to  the  task  of  function  approximation. 


2  How  to  Synthesize  through  Learning  the 
Basic  Approximation  Module:  Regular¬ 
ization  Networks 

This  section  describes  a  technique  for  synthesizing  the  approximation  mod¬ 
ules  discussed  above  through  learning  from  examples.  I  first  explain  how 
to  rephrase  the  problem  of  learning  from  examples  as  a  problem  of  ap¬ 
proximating  a  multivariate  function.  The  material  in  this  section  is  from 
Poggio  and  Girosi  (1989,  1990a,  1990b),  where  more  details  can  be  found. 

To  illustrate  the  connection,  let  us  draw  an  analogy  between  learning  an 
input-output  mapping  and  a  standard  approximation  problem,  2-D  surface 
reconstruction  from  sparse  data  points.  Learning  simply  means  collecting 
the  examples,  i.e.,  the  input  coordinates  z;,  j/j  and  the  corresponding  output 
values  at  those  locations,  the  heights  of  the  surface  dt.  Generalization 
means  estimating  d  at  locations  x,y  where  there  are  no  examples,  i.e.,  no 
data.  This  requires  interpolating  or,  more  generally,  approximating  the 
surface  (i.e.,  the  function)  between  the  data  points  (interpolation  is  the 
limit  of  approximation  when  there  is  no  noise  in  the  data).  In  this  sense, 
learning  is  a  problem  of  hypersurface  reconstruction  (Poggio  et  al.,  1988, 
1989;  Omohundro,  1987). 

From  this  point  of  view,  learning  a  smooth  mapping  from  examples  is 
clearly  ill-posed,  in  the  sense  that  the  information  in  the  data  is  not  suf¬ 
ficient  to  reconstruct  uniquely  the  mapping  at  places  where  data  are  not 
available.  In  addition,  the  data  are  usually  noisy.  A  priori  assumptions 
about  the  mapping  are  needed  to  make  the  problem  well- posed.  One  of 
the  simplest  assumptions  is  that  the  mapping  is  smooth:  small  changes  in 
the  inputs  cause  a  small  change  in  the  output.  Techniques  that  exploit 
smoothness  constraints  in  order  to  transform  an  ill-posed  problem  into  a 
well-posed  one  are  well  known  under  the  term  of  regularization  theory,  and 
have  interesting  Bayesian  applications  (Tikhonov  and  Arsenin,  1977;  Pog- 
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gio,  Torre  and  Koch,  1985;  Bertero,  Poggio  and  Torre,  1988).  We  have 
recently  shown  that  the  solution  to  the  approximation  problem  given  by 
regularization  theory  can  be  expressed  in  terms  of  a  class  of  multilayer  net¬ 
works  that  we  call  regularization  networks  or  Hyper  Basis  Functions  (see 
Fig.  1).  Our  main  result  (Poggio  and  Girosi,  1989)  is  that  the  regulariza¬ 
tion  approach  is  equivalent  to  an  expansion  of  the  solution  in  terms  of  a 
certain  class  of  functions: 


/(x)  =  53c.-G(x;&)  +?(*)  (!) 

i=i 

where  G(x)  is  one  such  function  and  the  coefficients  e,  satisfy  a  linear 
system  of  equations  that  depend  on  the  N  “examples  ”  ?  r*  the  data  to 
be  approximated.  The  term  p(x)  is  a  polynomial  th<*'  lepends  on  the 
smoothness  assumptions.  In  many  cases  it  is  convenitut  to  include  up 
to  the  constant  and  linear  terms.  Under  relatively  broad  assumptions, 
the  Green’s  function  G  is  radial  and  therefore  the  approximating  function 
becomes: 


y 

f(x)  =  £c,C(||x-£t|jJ)+p(x),  (2) 

>=i 

which  is  a  sum  of  radial  functions,  each  with  its  center  ^  on  a  distinct 
data  point  and  of  constant  and  linear  terms  (from  the  polynomial,  when 
restricted  to  be  of  degree  one).  The  number  of  radial  functions,  and  corre¬ 
sponding  centers,  is  the  same  as  the  number  of  examples. 

The  interpretation  of  Eq.  2  is  simple:  for  instance,  in  the  2D  case  - 
in  which  the  examples  corresponds  to  points  of  the  x,y  plane  where  the 
height  of  the  surface  is  known  and  generalization  corresponds  to  estimate 
the  height  of  the  surface  at  a  point  in  the  plane  where  data  are  not  avail¬ 
able  -  the  surface  is  approximated  by  the  superposition  of,  say,  several 
two  dimensional  Gaussian  distributions,  each  centered  on  one  of  the  data 
points. 

Our  derivation  shows  that  the  type  of  basis  functions  depends  on  the 
specific  a  priori  assumption  of  smoothness.  Depending  on  it  we  obtain  the 
Gaussian  G(r )  =  e“*"l  ,  the  well  known  “thin  plate  spline”  G(r)  =  r2lnr, 
and  other  specific  functions,  radial  and  not.  As  observed  by  Broomhead 
and  Lowe  (1989)  in  the  radial  case,  a  superposition  of  functions  like  Eq.  1 
is  equivalent  to  a  network  of  the  type  shown  in  Fig.  lb. 

The  network  associated  with  Eq.  2  can  be  made  more  general  in  terms 
of  the  following  extension 
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(3) 


/•(*)  =  T.  c«G(ll*  -  t„)iiV) + ?(*) 

a-l 

where  the  parameters  ta,  that  we  call  “centers,”  and  the  coefficients  ca  are 
unknown,  and  are  in  general  much  fewer  than  the  data  points  (n  <  jV). 
The  norm  is  a  weighted  norm 

||(x  -  tQ)||^  =  (x  -  t a)TWTW(x  -  ta)  (4) 

where  W  is  an  unknown  square  matrix  and  the  superscript  T  indicates  the 
transpose.  In  the  simple  case  of  diagonal  W  the  diagonal  elements  assign 
a  specific  weight  to  each  input  coordinate,  determining  in  fact  the  units  of 
measure  and  the  importance  of  each  feature  (the  matrix  W  is  especially 
important  in  cases  in  which  the  input  features  are  of  a  different  type  and 
their  relative  importance  is  unknown).  Equation  3  can  be  implemented  by 
the  network  of  Fig.  1.  A  sigmoid  function  at  the  output  may  sometimes 
be  useful  without  increasing  the  complexity  of  the  system  (see  Poggio  and 
Girosi,  1989).  Notice  that  there  could  be  more  than  one  set  of  Green’s 
functions,  for  instance  a  set  of  multiquadrics  and  a  set  of  Gaussians,  each 
with  its  own  W.  Two  or  more  sets  of  Gaussians,  each  with  a  diagonal  W, 
are  equivalent  to  sets  of  Gaussians  with  their  own  os. 


2.1  Learning 

In  the  framework  of  the  previous  section  the  stage  of  learning  is  simply 
the  stage  of  estimating  from  the  data  -  the  examples  -  the  values  of  the 
parameters  in  the  representation  we  have  derived,  i.e.  equation  (4).  Itera¬ 
tive  methods  can  be  used  to  find  the  optimal  values  of  the  various  sets  of 
parameters,  the  ca,  the  u>,  and  the  ta,  that  minimize  an  error  functional 
on  the  set  of  examples.  Steepest  descent  is  the  standard  approach  that 
requires  calculations  of  derivatives.  An  even  simpler  method  that  does  not 
require  calculation  of  derivatives  (suggested  and  found  surprisingly  efficient 
in  preliminary  work  by  Caprile  and  Girosi,  personal  communication)  is  to 
look  for  random  changes  (controlled  in  appropriate  ways)  in  the  parameter 
values  that  reduce  the  error.  We  define  the  error  functional  -  also  called 
energy  -  as 

hid  =  tfc.t.w  =  E(a,)3, 

>=l 

with 


E 

O 


A i=yi-  /*(x)  =yi-J2  caG(||x<  -  ta||^). 

a  =  l 


In  the  first  method  the  values  of  ca,  ta  and  W  that  minimize  H[f*} 
are  regarded  as  the  coordinates  of  the  stable  fixed  point  of  the  following 
dynamical  system: 


cn  =  -UJ- 


dH[f\ 

dca  ’ 

dH[f] 


=  at 


a  =  1, . . . ,  n 
a  =  1, . . . ,  n 


W  =  -a; 


mr  1 

dW 


where  a>  is  a  parameter.  The  derivatives  are  rather  complex  (see  Poggio 
and  Girosi,  1990a  and  Notes  section). 

The  second  method  is  simpler:  random  changes  in  the  parameters  are 
made  and  accepted  if  H[f*\  decreases.  Occasionally,  changes  that  increase 
H[f*}  may  also  be  accepted  (similarly  to  the  Metropolis  algorithm). 


2.2  Interpretation  of  the  Network 

The  interpretation  of  the  network  of  Fig.  1  is  the  following.  After  learn¬ 
ing,  the  centers  of  the  basis  functions  are  similar  to  prototypes,  since  they 
are  points  in  the  multidimensional  input  space.  Each  unit  computes  a 
(weighted)  distance  of  the  inputs  from  its  center,  that  is  a  measure  of  their 
similarity,  and  applies  to  it  the  radial  function.  In  the  case  of  the  Gaussian, 
a  unit  will  have  maximum  activity  when  the  new  input  exactly  matches 
its  center.  The  output  of  the  network  is  the  linear  superposition  of  the 
activities  of  all  the  basis  functions  in  the  network,  plus  direct,  weighted 
connections  from  the  inputs  (the  linear  terms  of  p(x))  and  from  a  con¬ 
stant  input  (the  constant  term).  Notice  that  in  the  limit  case  of  the  basis 
functions  approximating  delta  functions,  the  system  becomes  equivalent  to 
a  look-up  table.  During  learning  the  weights  c  are  found  by  minimizing 
a  measure  of  the  error  between  the  network’s  prediction  and  each  of  the 
examples.  At  the  same  time,  the  centers  of  the  radial  functions  and  the 
weights  in  the  norm  are  also  updated  during  learning.  Moving  the  centers 
is  equivalent  to  modifying  the  corresponding  prototypes  and  corresponds 
to  task-dependent  clustering.  Finding  the  optimal  weights  W  for  the  norm 
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is  equivalent  to  transforming  appropriately,  for  instance  scaling,  the  input 
coordinates  and  corresponds  to  task-dependent  dimensionality  reduction. 

Regularization  networks-  of  which  HyperBFs  are  the  most  general  and 
powerful  version  -  represent  a  general  framework  for  learning  smooth  map¬ 
pings  that  rigorously  connects  approximation  theory,  generalized  splines 
and  regularization  with  feedforward  multilayer  networks.  They  also  con¬ 
tain  as  special  cases  the  Radial  Basis  Functions  technique  (Micchelli,  1986; 
Powell,  1987;  Broomhead  and  Lowe,  1988)  and  several  well-known  algo¬ 
rithms,  especially  in  the  pattern  recognition  literature. 

3  A  Proposal  for  a  Biological  Implementa¬ 
tion 

In  this  section  we  point  out  some  remarkable  properties  of  Gaussian  Hy- 
perBF,  that  may  have  implications  for  neurobiology. 


3.1  Factorizable  Radial  Basis  Functions 

The  synthesis  of  (weighted)  radial  basis  functions  in  high  dimensions  may 
be  easier  if  they  are  factorizable.  It  is  easily  seen  that  the  only  radial 
basis  function  which  is  factorizable  is  the  Gaussian  ( with  diagonal  W ).  A 
multidimensional  Gaussian  function  can  be  represented  as  the  product  of 
lower  dimensional  Gaussians.  For  instance  a  2D  Gaussian  radial  function 
centered  in  t  can  be  written  as: 


G(||x-  t||&r)  = 


(5) 


with  <7 x  =  i/u>i  and  ay  =  l/u>2,  where  wx  and  are  the  elements  of  the 
matrix  W  assumed,  in  this  section,  to  be  diagonal. 

This  dimensionality  factorization  is  especially  attractive  from  the  phys¬ 
iological  point  of  view,  since  it  is  difficult  to  imagine  how  neurons  could 
compute  G(||x  —  tQ||2).  The  scheme  of  figure  2,  on  the  other  hand,  is 
physiologically  plausible.  Gaussian  radial  functions  in  one,  two  and  pos¬ 
sibly  three  dimensions  can  be  implemented  as  receptive  fields  by  weighted 
connections  from  the  sensor  arrays  (or  some  retinotopic  array  of  units  rep¬ 
resenting  with  their  activity  the  position  of  features).  Gaussians  in  higher 
dimensions  can  then  be  synthesized  as  products  of  one  and  two  dimensional 
receptive  fields. 

This  scheme  has  three  additional  interesting  features: 
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1.  the  multidimensional  radial  functions  are  synthesized  directly  by  ap¬ 
propriately  weighted  connections  from  the  sensor  arrays,  without  any 
need  of  an  explicit  computation  of  the  norm  and  the  exponential. 

2.  2D  Gaussians  operating  on  the  sensor  array  or  on  a  retinotopic  array 
of  features  extracted  by  some  preprocessing  transduce  the  implicit 
position  of  features  in  the  array  into  a  number  (the  activity  of  the 
unit). 

3.  2D  Gaussians  acting  on  a  retinotopic  map  can  be  regarded  each  as 
representing  one  2D  “feature,”  i.e.,  a  component  of  the  input  vec¬ 
tor,  while  each  center  represents  the  “template,”  resulting  from  the 
conjunction  of  those  lower-dimensional  features.  Notice  that  in  this 
analogy  the  radial  basis  function  is  the  AND  of  several  features  and 
could  also  include  the  negation  of  certain  features,  that  is  the  AND 
NOT  of  them.  W  weights  the  importance  of  the  different  features. 

3.2  Biophysical  Mechanisms 

3.2.1  The  Network 

The  multiplication  operation  required  by  the  previous  interpretation  of 
Gaussian  GRBFs  to  perform  the  “conjunction”  of  Gaussian  receptive  fields 
is  not  too  implausible  from  a  biophysical  point  of  view.  It  could  be  per¬ 
formed  by  several  biophysical  mechanisms  (see  Koch  and  Poggio,  1987). 
Here  we  mention  three  mechanisms: 

1.  inhibition  of  the  silent  type  and  related  circuitry  (see  Torre  and  Pog¬ 
gio,  1978;  Poggio  and  Torre,  1978) 

2.  the  AND-Iike  mechanism  of  NMDA  receptors 

3.  a  logarithmic  transformation,  followed  by  summation,  followed  by 
exponentiation.  The  logarithmic  and  exponential  characteristic  could 
be  implemented  in  appropriate  ranges  by  the  sigmoid-like  pre-to- 
postsynaptic  voltage  transduction  of  many  synapses. 

If  the  first  or  the  second  mechanism  is  used,  the  product  of  figure  3 
can  be  performed  directly  on  the  dendritic  tree  of  the  neuron  representing 
the  corresponding  radial  function  (alternatively,  each  dendritic  tree  may 
perform  pairwise  products  only,  in  which  case  a  logarithmic  number  of 
cells  would  be  required).  The  scheme  also  requires  a  certain  amount  of 
memory  per  basis  unit,  in  order  to  store  the  center  vector.  In  the  case  of 
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Gaussian  receptive  fields  used  to  synthesize  Gaussian  radial  basis  functions, 
the  center  vector  is  effectively  stored  in  the  position  of  the  2D  (or  ID) 
receptive  fields  and  in  their  connections  to  the  product  unit(s).  This  is 
plausible  physiologically. 

The  linear  terms  (the  direct  connections  from  the  inputs  to  the  output 
in  figure  1)  can  be  realized  directly  as  inputs  to  the  output  neuron  that 
summates  linearly  its  synaptic  inputs  (an  output  nonlinearity  is  allowed 
and  will  not  change  the  basic  form  of  the  model,  see  Poggio  and  Girosi, 
1989).  They  may  also  be  realized  through  intermediate  linear  units. 

3.2.2  Mechanisms  for  Learning 

Do  the  update  schemes  have  a  physiologically  plausible  implementation? 
Consider  first  the  steepest  descent  methods,  which  require  derivatives. 
Equation  (6)  or  a  somewhat  similar,  quasi-hebbian  scheme  is  not  too  un¬ 
likely  and  may  require  only  a  small  amount  of  neural  circuitry.  Equation 
(7)  seems  more  difficult  to  implement  for  a  network  of  real  neurons. 

Methods  such  as  the  random  descent  method,  which  do  not  require 
calculation  of  derivatives  are  biologically  much  more  plausible  and  seem 
to  perform  very  well  in  preliminary  experiments.  In  the  Gaussian  case, 
with  basis  functions  synthesized  through  the  product  of  Gaussian  receptive 
fields,  moving  the  centers  means  establishing  or  erasing  connections  to  the 
product  unit.  A  similar  argument  can  be  made  also  about  the  learning  of 
the  matrix  W.  Notice  that  in  the  diagonal  Gaussian  case  the  parameters 
to  be  changed  are  exactly  the  a  of  the  Gaussians,  i.e.,  the  spread  of  the 
associated  receptive  fields.  Notice  also  that  the  cr  for  all  centers  on  one 
particular  dimension  is  the  same,  suggesting  that  the  learning  of  may 
involve  the  modification  of  the  scale  factor  in  the  input  arrays  rather  than 
a  change  in  the  dendritic  spread  of  the  postsynaptic  neurons. 

In  all  these  schemes  the  real  problem  consists  in  how  to  provide  the 
“teacher”  input  (but  see  figure  5). 

4  Visual  Recognition  of  3D  Objects  and 
Face  Sensitive  Neurones 

We  have  recently  suggested  and  demonstrated  how  to  use  a  HyperBF  net¬ 
work  to  learn  to  recognize  a  3D  object.  This  section  reviews  very  briefly 
this  work  (Poggio  and  Edelman,  1990),  where  more  references  can  be  found, 
and  then  suggests  that  the  brain  may  use  a  similar  strategy.  Face  sensitive 
neurons  are  discussed  as  a  specific  instance. 
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4.1  HyperBF  Networks  for  Recognizing  3D  Objects 

A  3D  object  gives  rise  to  an  infinite  variety  of  2D  images  or  views,  because 
of  the  infinite  number  of  possible  poses  relative  to  the  viewer,  and  because 
of  arbitrarily  different  illumination  conditions.  Is  it  possible  to  synthesize 
a  module  that  can  recogn:ze  an  object  from  any  viewpoint,  after  it  learns 
its  3D  structure  from  a  small  *et  of  perspective  views?  We  have  have 
recently  shown  (Poggio  and  Edelman,  1990)  that  the  HyperBF  scheme 
may  provide  a  solution  to  the  problem  provided  that  relatively  stable  and 
uniquely  identifiable  features  (that  we  will  call  “labeled”  features)  can  be 
extracted  from  the  image. 

In  our  scheme  a  view  is  represented  as  a  2  N  vector  zlt  J/i,  Sri>  •  •  •  ■>  xNiVN 
of  the  coordinates  on  the  image  plane  of  N  labeled  and  visible  feature  points 
on  the  object.  We  assume  that  a  view  of  an  object  is  a  vector  of  this  type 
(instead  of  position  in  the  image  of  feature  points  we  have  also  used  angles 
between  corners  and  length  of  segments  or  both),  in  general  augmented  by 
components  that  represent  other  properties  of  the  object  not  necessarily 
related  to  its  geometric  shape,  such  as  color  or  texture.  We  also  assume 
that  the  function  that  maps  the  views  into  0, 1  (0  if  the  view  is  of  another 
object,  1  if  the  view  is  of  the  correct  object)  can  be  approximated  by  a 
smooth  function  (if  this  were  false,  one  could  approximate  the  mapping 
from  the  view  to  a  “standard”  view  and  then  apply  a  radial  function  to 
the  result,  see  Poggio  and  Edelman,  1990). 

The  network  used  for  this  task  is  shown  in  Figure  3  (see  also  Figure 
4).  In  the  simplest  version  (fixed  centers)  the  centers  correspond  to  some 
of  the  examples,  i.e. ,  some  views  of  the  object.  Updating  the  centers  is 
equivalent  to  modifying  the  corresponding  “prototypical  views”.  Updating 
the  weights  of  the  matrix  W  corresponds  to  changing  the  relative  impor¬ 
tance  of  the  various  features  that  define  the  views  of  an  object.  This  is 
important  in  the  case  in  which  these  features  are  of  a  completely  different 
type:  a  large  w  indicates  a  larger  weight  in  the  feature  in  the  measure  of 
similarity  and  is  equivalent  to  a  small  u  in  the  Gaussian  function.  Fea¬ 
tures  with  a  small  role  have  a  very  large  cr:  their  exact  position  or  value 
does  not  matter  much.  Of  course,  the  problem  the  network  solves  is  a 
caricature  of  the  full  problem  of  object  recognition:  one  isolated  object, 
without  occlusions  or  noise  and  moreover  with  image  features  assumed  to 
be  matched  to  models  features  (for  a  similar  approach  see  [2]).  Existing 
computer  vision  algorithms  for  model-based  recognition  typically  deal  with 
more  complex  situations.  We  think  however  that  the  approach  described 
here  can  be  extended  to  more  realistic  tasks.  In  a  first  step  in  this  direction 
we  have  successfully  extended  the  algorithm  to  deal  with  noisy,  real,  mildly 
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selfo colliding  objects  (Brunelli  and  Poggio,  in  press). 

An  interesting  conclusion  of  this  work  consists  of  the  small  number 
of  views  that  is  required  to  recognize  an  object  from  the  infinite  number 
of  possible  views.  The  results  clearly  show  that  the  scheme  avoids  the 
main  problem  of  look-up  table  schemes,  the  explosion  in  the  number  of 
entries.  Furthermore,  the  performance  of  the  HyperBF  recognition  scheme 
resembles  human  performance  in  a  related  task.  As  discussed  in  Poggio 
and  Edelman  (1990),  the  number  of  training  views  necessary  to  achieve 
an  acceptable  recognition  rate  on  novel  views,  80-100  for  the  full  viewing 
sphere,  is  broadly  compatible  with  the  finding  that  people  have  trouble 
recognizing  a  novel  wire-frame  object  previously  seen  from  one  viewpoint 
if  it  is  rotated  away  from  that  viewpoint  by  about  30°  (it  takes  72  30°  x  30° 
patches  to  cover  the  viewing  sphere). 

Recently,  Biilthoff  and  Edelman  (1990)  and  references  therein  have  ob¬ 
tained  interesting  psychophysical  results  that  support  this  model  for  human 
recognition  of  a  certain  class  of  3D  objects  against  other  possible  models. 
In  general,  the  experimental  results  fit  closely  the  prediction  of  theories  of 
the  2D  interpolation  variety  and  appear  to  contradict  theories  that  involve 
3D  models. 

4.2  Face  Sensitive  Neurons 

The  HyperBF  recognition  scheme  we  have  outlined  has  suggestive  simi¬ 
larities  with  some  of  the  data  about  visual  neurons  responding  to  faces 
obtained  by  Perrett  and  coworkers  recording  from  the  temporal  associa¬ 
tion  cortex  (see  Perrett  et  al.,  1987  and  references  therein,  Poggio  and 
Edelman,  1990).  Let  us  consider  the  network  of  figure  3  as  the  skeleton  for 
a  model  of  the  circuitry  involved  in  the  recognition  of  faces.  One  expects 
different  modules  one  for  each  different  object  of  the  type  of  the  network  of 
Figure  3.  One  also  expects  hierarchical  organizations:  for  instance  a  net¬ 
work  of  the  HyperBF  type  may  be  used  to  recognize  certain  types  of  eyes 
and  then  may  serve  as  input  to  another  network  involved  in  recognizing  a 
certain  class  of  faces,  which  may  be  itself  one  of  the  inputs  to  a  network  for 
a  specific  face.  Different  types  of  cells  may  then  be  expected.  The  overall 
output  of  a  network  for  a  specific  face  may  be  identified  with  the  behavioral 
responses  associated  with  recognition  and  may  or  may  not  coincide  with  an 
individual  neuron.  There  should  be  cells  or  parts  of  cells  corresponding  to 
the  centers,  i.e.,  to  the  prototypes  used  by  the  networks.  The  response  of 
these  neurons  should  be  a  Gaussian  function  of  the  distance  of  the  input  to 
the  template.  These  units  would  be  somewhat  similar  to  “grandmother” 
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filters  with  a  graded  response,  rather  than  binary  detectors,  each  repre¬ 
senting  a  prototype.  They  would  be  synthesized  as  the  conjunction  of,  for 
instance,  two-dimensional  Gaussian  receptive  fields  looking  at  a  retinotopic 
map  of  features.  During  learning,  the  weights  of  the  various  prototypes  in 
the  network  output  are  modified  to  find  the  optimal  values  that  minimize 
the  overall  error.  The  prototypes  themselves  are  slowly  changed  to  find  op¬ 
timal  prototypes  for  the  task.  The  weights  of  the  different  input  features 
is  also  modified  to  perform  task-dependent  dimensionality  reduction. 

Some  of  these  expectations  are  consistent  with  the  experimental  find¬ 
ings  of  Perret  et  al.  (1987).  Some  of  the  neurons  described  have  several 
of  the  properties  expected  from  the  units  of  a  HyperBF  network  with  a 
center,  i.e.,  a  prototype  that  corresponds  to  a  view  of  a  specific  face. 

Some  of  the  Main  Data  (from  Perret  et  al.,  198 7  and  references  therein) 

•  The  majority  of  cells  responsive  to  faces  are  sensitive  to  the  general 
characteristics  of  the  face  and  they  are  somewhat  invariant  to  its 
exact  position  and  attitude. 

•  Presenting  parts  of  the  face  in  isolation  revealed  that  some  of  the 
cells  responded  to  different  subsets  of  features:  some  cells  are  more 
sensitive  to  parts  of  the  face  such  as  eyes  or  mouth. 

•  There  are  cells  selective  for  a  particular  view  of  the  head.  Some 
cells  were  maximally  sensitive  to  the  front  view  of  a  face,  and  their 
response  fell  off  as  the  head  was  rotated  into  the  profile  view,  and 
others  were  sensitive  to  the  profile  view  with  no  response  to  the  front 
view  of  the  face. 

•  There  are  cells  that  are  specific  to  the  views  of  one  individual.  It 
seems  that  for  each  known  person  there  would  be  a  set  of  ’face  recog¬ 
nition  units’.  Our  model  applies  most  directly  to  these  neurons. 

5  Theories  of  the  Cerebellum  and  of  Motor 
Control 

5.1  Marr’s  and  Albus  Models  of  the  Cerebellum 

The  cerebellum  is  a  part  of  the  brain  that  is  important  in  the  coordination 
of  complex  muscle  movements.  The  neural  organization  of  the  cerebellum 
is  highly  regular  and  well  known  (see  Figure  5).  Marr  (1969)  and  Albus 
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(1972)  modeled  the  cerebellum  as  a  look-up  table.  The  critical  part  of 
their  theories  is  the  assumption  that  the  synapses  between  the  parallel 
fibers  and  the  Purkinje  cells  are  modified  as  a  function  of  the  Purkinje  cell 
activity  and  the  climbing  fibers  input.  I  suggest  (see  figure  5)  that  the 
cerebellum  is  a  HyperBF  network  or  set  of  networks  (one  for  each  Purkinje 
cell).  Instead  of  a  simple  look-up  table,  the  cerebellum  would  be  a  function 
approximation  module  (in  a  sense,  “an  approximating  look-up  table”).  In 
our  conjecture,  basket  and  Golgi  cells  would  have  different  roles  from  the 
roles  assumed  in  the  Marr-Albus  theory.  In  particular,  the  Golgi  cells, 
which  receive  inputs  from  the  parallel  fibers  and  whose  axons  synapse  on 
the  granule  cells-mossy  fibers  clusters,  may  be  used  to  change  the  norm 
weights  W. 

Key  Assumptions 

•  granule  cells  correspond  to  basis  units  (there  may  be  as  many  as 
200,000  granule  cells  per  Purkinje  cell)  representing  as  many  ”exam- 
ples” 

•  Purkinje  cells  are  the  outputs  of  the  network 

•  climbing  fibers  are  responsible  for  modifying  synapses  from  granule 
cells  to  the  Purkinje  cell. 

5.2  Theories  of  Motor  Control 

There  are  at  least  two  aspects  of  motor  control  in  which  HyperBF  modules 
could  be  used 

•  to  compute  smooth,  time-dependent  trajectories  -  for  instance  arm 
trajectories  -  given  sparse  points  such  as  initial,  final  and  intermedi¬ 
ate  positions. 

•  to  associate  to  each  position  in  the  trajectory  the  appropriate  field 
of  muscle  forces. 

These  two  problems  may  be  solved  by  two  modules  that  can  be  used  in 
series,  the  first  one  providing  the  input  to  the  second  one  (see  figure  6a 
and  6b).  I  will  first  consider  the  problem  of  computing  appropriate  smooth 
trajectories  from  sparse  points  in  space-time.  An  interesting  question  is: 
are  HyperBFs  a  plausible  implementation  for  Flash  and  Hogan’s  minimum 
jerk  principle  for  the  coordination  of  arm  movements?  Flash  and  Hogan 
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(1985)  found  experimental  evidence  that  arm  trajectories  minimize  jerk, 
i.e.,  C  =  ||x*3*  +  1 12,  where  x(3)  is  the  third  temporal  derivative  of  x. 

This  suggests  a  regularization  principle  with  a  stabilizer  corresponding  to 
add  ve  quintic  splines.  HyperBF  could  implement  it  using  basis  units 
recruited  for  the  specific  motion  (as  many  a„  there  are  constrained  points) 
with  Gaussian-like  or  spline-like  time-dependent  activities  (boundary  con¬ 
ditions  may  have  to  be  taken  into  account).  The  weights  would  be  learned 
during  training.  As  Morasso  and  Mussa  Ivaldi  (1982)  implied,  approxi¬ 
mation  schemes  of  this  type  amount  to  composition  of  elemental  move¬ 
ments.  It  is  interesting  to  observe  that  jerk  is  automatically  minimized 
by  the  linear  superposition  of  the  appropriate  elemented  movements,  i.e., 
the  appropriate  Green’s  functions.  Thus  a  scheme  of  the  Morasso-Mussa 
Ivaldi  type  can  be  made  to  be  perfectly  equivalent  to  the  Flash-Hogan 
minimization  principle.  The  fact  that  the  minimum  jerk  principle  can  be 
implemented  directly  by  a  HyperBF  network  is  attractive  from  the  point 
of  view  of  a  biological  implementation  since  biologically  implausible  direct 
minimization  procedures  are  not  required  anymore.  The  minimization  is 
implicit  in  the  form  of  the  elemental  movements;  weighted  superposition 
of  the  elemental  movements  seems  a  much  easier  operation  to  implement 
in  the  motor  system  than  explicit  minimization. 

The  second  problem  requires  a  neural  circuit  that  associates  an  equi¬ 
librium  position  to  an  appropriate  activation.  Bizzi  (see  for  instance  Bizzi 
et  al.,  1990)  suggests  that  a  group  of  spinal  cord  interneurons  specify  the 
limb’s  final  position  and  configuration  through  a  field  of  muscle  forces  that 
have  the  appropriate  equilibrium  point.  Bizzi  et  al.  (1990)  propose  that 
the  spinal  cord  contains  aspects  of  motor  behavior  reminiscent  of  a  look¬ 
up  table.  Their  findings  extend  several  results  in  the  area  of  oculomotor 
research,  where  investigators  have  described  neural  structures  whose  acti¬ 
vation  brings  the  eyes  or  the  head  to  a  unique  position.  I  suggest  that  the 
required  look-up  table  behavior  may  be  implemented  through  a  HyperBF 
module  that  requires  the  storage  of  only  a  few  equilibrium  position  (or 
correspondingly  a  few  conservative-like  fields,  i.e.,  appropriate  activation 
coefficients  for  the  motoneurons)  and  can  interpolate  between  them  (see 
figure  6).  Notice  that  the  synthesis  of  a  conservative  field  of  muscle  force 
could  be  achieved  through  the  superposition  (with  arbitrary  weights,  over 
the  index  a)  by  the  motor  system  of  appropriate  elementary  motor  fields 
of  the  form  (see  Mussa-Ivaldi  and  Gisztcr,  1990  in  preparation): 

<t>(x,a)  =  - — —  G'{ra) 


14 


with  ra  =  | |x  —  xa 1 1  and  G  is  a  radial  basis  function  such  as  the  Gaussian. 


6  Summary:  a  Proposal  for  How  the  Brain 
Works 

The  theory  proposed  in  this  paper  consists  of  three  main  points: 

1.  it  assumes  that  the  brain  may  use  modules  that  approximate  multi¬ 
variate  functions  and  that  can  be  synthesized  from  sparse  examples 
as  basic  components  for  several  information  processing  tasks. 

2.  it  proposes  that  these  modules  are  realized  in  terms  of  HyperBF 
networks,  of  which  a  rigorous  theory  is  now  available. 

3.  it  shows  how  HyperBF  networks  can  be  implemented  in  terms  of 
plausible  biophysical  mechanisms. 

The  theory  is  in  a  sense  a  modern  version  of  the  grandmother  neurons 
idea,  made  computationally  plausible  by  eliminating  the  combinatorial  ex¬ 
plosion  in  the  number  of  required  cells  that  was  the  main  problem  in  the 
old  idea. 

The  proposal  that  much  information  processing  in  the  brain  is  per¬ 
formed  through  modules  that  are  similar  to  enhanced  look-up  tables  is 
attractive  for  many  reasons.  It  also  promises  to  bring  closer  apparently 
orthogonal  views,  such  as  the  immediate  perception  of  Gibson  and  the  rep¬ 
resentational  theory  of  Marr,  since  almost  iconic  “snapshots”  of  the  world 
may  allow  the  synthesis  of  computational  mechanisms  completely  equiv¬ 
alent  to  vision  algorithms  such  as,  say,  structure-from- motion.  The  idea 
seems  to  change  significantly  the  computational  perspective  on  several  vi¬ 
sion  tasks.  As  a  simple  example,  consider  the  different  specific  tasks  of 
hyperacuity,  as  considered  by  the  psychophysicists.  The  theory  developed 
here  would  suggest  that  an  appropriate  module  for  the  task,  somewhat 
similar  to  a  new  “routine,”  may  be  synthesized  by  learning  in  the  brain. 

Notice  that  the  theory  makes  two  independent  claims:  the  first  is  that 
the  brain  can  be  explained  in  part  in  terms  of  approximation  modules, 
the  second  is  that  these  modules  are  of  the  HyperBF  type.  The  second 
claim  implies  that  the  modules  are  an  extension  of  look-up  tables.  Notice 
that  there  are  schemes  other  than  HyperBF  that  could  be  used  to  extend 
look-up  tables.  Notice  also  that  multilayer  Perceptrons,  typically  used  in 
conjunction  with  back-propagation,  can  also  be  considered  as  approxima¬ 
tion  schemes,  albeit  still  without  a  convincing  mathematical  foundation. 
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of  look-up  tables  (they  are  more  similar  to  an  extension  of  multidimensional 
Fourier  series). 

The  theory  suggests  that  population  coding  (broadly  tuned  neurons 
combined  linearly)  is  a  consequence  of  extending  a  look-up  table  scheme  - 
corresponding  to  interval  coding  -  to  yield  interpolation  (or  more  precisely 
approximation,  since  the  examples  may  be  noisy),  that  is  generalization. 

The  theory  suggests  some  possibly  interesting  ideas  about  the  evolution 
of  intelligence.  It  also  makes  a  number  of  predictions  for  physiology  and 
psychophysics.  More  work  is  needed  to  specify  sufficiently  the  details  and 
some  of  the  basic  assumptions  of  the  theory  in  order  to  make  it  useful  to 
biologists.  The  next  subsections  deal  with  these  last  three  points. 

6.1  Evolution  of  Intelligence:  From  Memory  to  Com¬ 
putation 

There  is  a  duality  between  computation  and  memory.  Given  infinite  re¬ 
sources  the  two  points  of  view  are  equivalent:  for  instance,  I  could  play 
chess  by  precomputing  winning  moves  for  every  possible  state  of  the  chess¬ 
board!  More  to  the  point,  notice  that  basic  logical  operations  can  be 
defined  in  terms  of  truth  tables  and  that  all  boolean  predicates  can  be 
represented  in  disjunctive  normal  form,  i.e.,  as  a  look-up  table. 

Given  that  the  brain  probably  has  a  prodigeous  amount  of  memory 
and  given  that  one  can  build  powerful  approximating  look-up  tables  using 
techniques  such  as  HyperBF,  is  it  possible  that  part  of  intelligence  may 
be  built  from  a  set  of  souped-up  look-up  tables?  One  advantage  of  this 
point  of  view  is  to  make  perhaps  easier  to  understand  how  intelligence 
may  have  evolved  from  simple  associative  reflexes.  In  more  than  one  sense 
(biophysical  and  computational),  HyperBF-like  networks  are  a  natural  and 
rather  straightforward  development  of  very  simple  systems  of  a  few  neurons 
showing  basic  learning  phenomena  such  as  classical  conditioning. 

6.2  Predictions  and  Remarks 

General  Predictions 

•  Computation,  as  generalization  from  examples,  emerges  from  the  su¬ 
perposition  of  receptive  fields  in  a  multidimensional  input  space. 
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•  Computation  is  performed  by  Gaussian  receptive  fields  and  their 
combination  (through  some  approximation  to  multiplication),  rather 
than  by  threshold  functions. 

•  The  theory  predicts  the  existence  of  low-dimensional  feature-like  cells 
and  multidimensional  Gaussian-like  receptive  fields,  somewhat  sim¬ 
ilar  to  template-like  cells.  One  would  expect  to  find  cells  that  are 
tuned  to  low-level  templates,  like  edges  or  corners,  and  others  that 
are  tuned  to  higher-level  templates  such  as  eyes  and  faces.  In  all 
cases,  the  prediction  is  that  the  activity  of  the  cell  should  be  graded 
and  should  depend  in  a  gaussian-like  way  on  the  distance  of  the  input 
from  the  optimal  template  along  any  of  the  defining  dimensions  ,  a 
fact  that  could  be  tested  experimentally  on  cortical  cells. 

•  The  HyperBF  scheme  is  a  general-purpose  circuit,  used  in  the  brain 
to  synthesize  modules  that  can  be  regarded  as  approximating  look-up 
tables.  If  this  point  of  view  is  correct,  we  expect  the  same  basic  kind 
of  neural  machinery  to  be  replicated  in  different  parts  of  the  brain 
across  different  modalities  (in  particular  in  different  cortical  areas). 

•  The  “programming  style”  used  by  the  brain  in  solving  specific  percep¬ 
tual  and  motor  problems  is  to  synthesize  appropriate  architectures 
from  modules  of  the  type  shown  in  figure  1  (a  very  simple  architecture 
built  from  the  basic  module  of  figure  1  is  shown  in  figure  4). 

Face  Neurons 

1.  Some  of  the  face  cells  correspond  to  basis  functions  with  centers  in  a 
high  dimensional  input  space  and  are  somewhat  similar  to  prototypes 
or  coarse  “grandmother  cells” 

2.  They  could  be  synthesized  as  the  conjunctions  of  features  with  Gaussian- 
like  distance  from  the  prototype. 

3.  Face  cells  are  not  detectors;  often  several  may  be  active  simultane¬ 
ously.  The  output  of  the  network  is  a  combination  of  several  proto¬ 
types. 

4.  From  our  preliminary  experiments  (Poggio  and  Edelman,  1990)  the 
number  of  basis  cells  that  are  required  per  object  is  about  40-80  for 
the  full  viewing  sphere,  but  much  less  (10-20)  for  each  aspect  (for 
instance  frontal  views).  I  conjecture  that  a  similar  estimate  holds  for 
faces. 
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5.  Input  to  the  face  cells  are  features  such  as  eye  positions,  mouth  po¬ 
sition,  hair  color  and  so  on. 


6.  Eye  features  cells  may  be  themselves  the  output  of  HyperBF  networks 
specialized  for  eyes. 

Cerebellum 

1.  The  cerebellum  is  a  set  of  approximation  modules  for  learning  to 
perform  motor  skills  (both  movements  and  posture). 

2.  Its  neurons  are  elements  of  a  HyperBF  network:  the  mossy  fibers 
are  the  inputs,  the  granule  cells  correspond  to  the  basis  functions 
G(x,  x,-),  the  Purkinje  cells  correspond  to  the  output  units  that  sum- 
mate  the  weighted  activities  of  the  basis  units,  whereas  the  climbing 
fibers  carry  the  “teacher”  signal 

3.  The  strength  of  the  modifiable  synapses  between  the  parallel  fibers 
and  the  Purkinje  cells  corresponds  to  the  ca. 

4.  Golgi  cells  may  be  involved  in  modifying  during  learning  the  center 
positions  ta  and  the  norm-weights  W. 

Motor  Control 

1.  The  qualitative  expectation  is  to  find  cells  and  circuits  corresponding 
to  the  two  stages  shown  in  figure  6.  Spinal  cord  neurons,  according 
to  very  recent  data  by  Bizzi  et  al.  (1990),  specify  the  limb’s  final 
position  and  configuration. 

6.3  The  Future 

The  proposal  of  this  paper  is  just  a  rough  sketch  of  a  theory.  Many  details 
-  some  of  them  critical  -  need  to  be  filled  in.  Some  basic  questions  remain. 
For  instance,  how  reasonable  is  the  idea  of  supervised  learning  schemes? 
Or,  to  say  it  in  a  different  and  perhaps  more  constructive  way,  what  are 
the  systems  that  can  be  synthesized  from  building  blocks  that  are  just 
function  approximation  modules?  And  what  types  of  tasks  can  be  solved 
by  systems  of  that  type? 

On  the  biological  side  of  the  theory,  the  obvious  next  task  is  to  develop 
detailed  proposals  for  the  circuitries  underlying  face  recognition,  and  motor 
control  (including  the  circuitry  of  the  cerebellum)  that  take  into  account 
up-to-date  physiological  and  anatomical  data. 
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NOTES 


Section  1 


•  Segmentation  of  an  image  in  parts  that  are  likely  to  correspond  to  separate 
objects  is  probably  the  most  difficult  problem  in  vision.  Remember  that 
already  in  the  Perceptron  book  (Minsky  and  Papert,  1969)  recognition- in¬ 
context  was  shown  to  be  significantly  harder  than  recognition  of  isolated 
patterns.  We  assume  here  that  this  problem  has  been  “solved,”  at  least 
to  a  reasonable  extent. 

•  The  same  basic  machinery  in  the  brain  may  be  used  for  synthesizing  many 
different,  “small”  learning  modules,  as  components  of  many  different  sys¬ 
tems.  This  is  very  different  from  suggesting  a  single  giant  network  that 
learns  everything. 


Section  2 

The  relevant  derivatives  for  optimization  methods  that  need  them  are 
•  for  the  ca 
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•  and  for  W: 


=  _4W  WdlXi  -  (8) 

a  =  l  i=l 

where  Qi>a  =  (*i  -  ta)(x ,•  -  ta)r  is  a  dyadic  product  and  G'  is  the  first 
derivative  of  G  (for  details  see  Poggio  and  Girosi,  1990a). 

Section  3 

•  There  are  many  non-radial  functions  derived  from  our  regularization  for¬ 
mulation  such  as  tensor  product  splines,  that  are  factorizable. 

•  I  have  assumed  here  that  all  centers  have  the  same  W.  It  is  possible 
to  have  sets  of  different  Green’s  functions,  each  set  with  its  own  W  (see 
Poggio  and  Girosi,  1990a). 
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•  It  is  natural  to  imagine  hierarchical  architectures  based  on  the  HyperBF 
scheme:  a  multidimensional  Gaussian  “template”  unit  may  be  a  “feature” 
input  for  another  radial  function  (again  because  of  the  factorization  prop¬ 
erty  of  the  Gaussian).  Of  course,  a  whole  HyperBF  network  may  be  one 
of  the  inputs  to  another  HyperBF  network. 

•  I  conjecture  that  equation  8  could  be  approximated  by  a  Hebbian-like  rule 
for  the  elements  of  the  diagonal  W  such  as 


u)fc(t  +  1)  =  u>i(t)  -  ]T  ca7(zfc(t)  -  (to.)fc)yfc(0>  (9) 

a= 1 

where  y  is  the  output  of  the  upper  layer  of  figure  la,  i.e.,  y  =  Wx  and  7 
is 


7  =  A,G'(||x<- ta||^),  (10) 

and  i  labels  the  i-th  example.  Such  a  Hebbian  rule  requires  back-connections 
from  later  stages  in  the  network  to  the  upper  layer  -  where  W  is  updated 
-  in  order  to  broadcast  quantities  such  as  the  erTor  of  the  overall  network 
relative  to  the  i-th  example  and  the  derivative  G'  of  the  activation  of  the 
units. 

•  The  mechanisms  and  especially  the  connections  needed  to  implement  the 
learning  equations  or  some  equivalent  scheme  are  an  open  question,  in 
terms  of  biological  plausibility.  More  work  is  needed. 

Section  4 

•  The  HyperBF  scheme  addresses  only  one  part  of  the  problem  of  shape- 
based  object  recognition,  the  variability  of  object  appearance  due  to  chang¬ 
ing  viewpoint.  The  key  issue  of  how  to  detect  and  identify  image  features 
that  are  stable  for  different  illuminations  and  viewpoints  is  outside  the 
scope  of  the  network. 

•  Notice  that  the  HyperBF  approach  to  recognition  does  not  require  as 
inputs  the  x,  y  coordinates  of  image  features:  other  parameters  of  appro¬ 
priate  features  can  also  be  used. 

•  In  a  similar  vein,  notice  that  the  HyperBF  network  can  provide,  with  the 
same  centers  (but  different  c),  other  parameters  of  the  object,  such  as  its 
pose,  instead  of  simply  a  yes,  no  recognition  signal. 
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•  Recognition  of  noisy  and  partially  occluded  objects,  using  realistic  feature 
identification  schemes,  requires  an  extension  of  the  scheme.  A  natural 
extension  of  the  scheme  is  based  on  the  use  of  multiple  lower- dimensional 
centers,  corresponding  to  different  subsets  of  detected  features,  instead 
of  one  21V-dimensional  center  for  each  view  in  the  example  set.  This 
corresponds  to  a  set  of  networks  capable  of  recognizing  different  parts  of 
an  object.  It  is  equivalent  to  a  set  of  networks  each  with  a  diagonal  W 
with  some  zero  entries  in  the  diagonal,  instead  of  one  network  with  W 
with  non  zero  diagonal  elements. 

•  Not  all  features  may  be  always  labeled  correctly.  In  general,  one  expects 
a  significant  “correspondence”  problem.  Possibly  the  easiest  solution  is 
to  generate  all  reasonable  sequences  of  labels  for  a  given  input  vector  and 
simply  try  them  out  on  the  network.  This  is  of  course  equivalent  to  trying 
in  parallel  the  given  input  on  many  networks  each  with  a  different  labeling 
of  its  inputs. 

•  An  obvious  use  of  these  learning/approximation  modules  based  on  the 
HyperBF  techniqued  is  based  on  a  hierarchical  composition  of  GRBF 
modules,  in  which  the  outputs  of  lower-level  modules  assigned  to  detect 
object  parts  and  their  relative  disposition  in  space  are  combined  to  allow 
recognition  of  complex  structured  objects.  Figure  4  is  an  example  of  this 
architecture. 


Section  5 

•  Zipser  and  Andersen  (1988)  have  presented  intriguing  simulations  sug¬ 
gesting  that  a  backpropagation  network  trained  to  solve  the  problem  of 
converting  visual  stimuli  in  retinal  coordinates  to  head  centered  coordi¬ 
nates  generates  receptive  fields  similar  to  the  ones  experimentally  found 
in  cortical  area  7  of  the  monkey.  We  conjecture  that  Andersen’s  data  may 
be  better  accounted  for  by  a  HyperBF  network.  For  simplicity,  let  us 
consider  the  one  dimensional  version  of  the  problem  Zipser  and  Andersen 
propose  is  solved  by  neurons  in  area  7.  The  position  of  a  spot  of  light  on 
the  retina  is  given  as  r;  the  eye  position  relative  to  the  head  is  also  known 
as  e.  The  problem  is  to  compute  the  position  of  the  spot  of  light  relative 
to  the  head,  i.e.,  h  =  r  - 1-  e.  Stated  in  these  terms,  the  problem  is  compu¬ 
tationally  trivial  and  its  solution  simply  requires  the  addition  of  the  two 
inputs  r  and  e.  The  situation  is,  however,  more  complicated  due  to  the 
actual  representation  in  which  r  and  e  are  given.  In  the  equation,  r  and  e 
are  represented  as  numbers.  Zipser  and  Andersen  assume,  in  accordance 
with  physiology,  a  different  representation:  they  assume  that  the  position 
r  of  a  spot  of  light  is  coded  by  the  presence  or  absence  of  activity  of  one 
or  more  cells  in  a  retinotopic  array.  From  this  point  of  view,  the  goal  of 
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the  computation  carried  out  by  the  network  is  to  change  representation 
from  array  representation  to  number  representation. 

The  simplest  solution  to  the  problem  of  changing  from  am  array  represen¬ 
tation  to  a  number  representation  is  the  following.  Assume  that  only  one 
cell  in  the  array  f{x)  is  excited  at  any  given  position,  i.e.,  f(x)  =  S(r  -  x). 
Simplifying  somewhat  the  situation  assumed  by  Zipser  and  Andersen,  but 
not  altering  it  in  any  significant  way,  let  us  assume  that  e  is  represented 
directly  as  a  number  or  a  firing  rate.  The  problem  then  is  to  convert  the 
array  representation  f[x)  =  S(r  -  x)  for  the  retinal  position  into  a  number 
(or  a  firing  rate)  representation.  Consider  a  linear  unit  that  summates  lin¬ 
early  all  inputs  with  the  “receptive  field”  w(x).  The  output  I  is  given  by 
I  =  J  w(x)/(x)dx.  For  f(x)  =  6(x  -  r),  the  choice  w(x)  =  x  yields  l  =  r. 
Thus  a  simple  solution  to  our  problem  of  converting  an  array  representar 
tion  into  a  number  representation  only  needs  receptive  fields  that  increase 
linearly  with  eccentricity  (notice  that  w(x)  =  ax  may  also  be  acceptable; 
simply  a  monotonic  dependence  on  x  may  be  a  sufficient  approximation). 

If  a  Gaussian  HyperBF  network  with  a  polynomial  term  of  degree  one  is 
used  to  approximate  the  relation  of  the  equation  from  a  set  of  input- output 
examples,  some  of  the  basis  functions  will  be  linear  units  such  as  the  ones 
described  above  and  some  will  be  the  product  of  2D  Gaussians  representing 
the  visual  receptive  fields  and  2D  Gaussians  representing  the  eye  position. 
These  latter  cells  would  probably  account  for  the  multiplicative  property 
of  the  area  7  cells  found  by  Andersen.  We  conjecture  that  other  features 
of  the  cells  could  be  replicated  in  a  HyperBF  simulation. 
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Figure  captions 

Fig.  1.  (a)  The  basic  learning  module  that  -  we  conjecture  -  is  used  by  the 
brain  for  a  number  of  tasks.  The  module  learns  to  approximate  a  multivariate 
functions  from  a  set  of  examples  (i.e.,  a  set  of  input-output  pairs),  (b  A  HyperBF 
network  equivalent  to  a  module  for  approximating  a  scalar  function  of  three 
variables  from  sparse  and  noisy  data.  The  data,  a  set  of  points  where  the 
value  of  the  function  is  known,  can  be  considered  as  examples  to  be  used  during 
learning.  The  hidden  units  evaluate  the  function  G(x;  tn),  and  a  fixed,  nonlinear, 
invertible  function  may  be  present  after  the  summation.  The  units  are  in  general 
fewer  than  the  number  of  examples.  The  parameters  that  Me  determined  during 
learning  are  the  coefficients  cn,  the  centers  tn  and  the  norm- weights  W.  In 
the  radial  case  G  =  <?(||x  -  tnj|^y)  and  the  hidden  units  simply  compute  the 
radial  basis  functions  G  at  the  “centers”  tn.  The  radial  basis  functions  may  be 
regarded  as  matching  the  input  vectors  against  the  “templates”  or  “prototypes” 
that  correspond  to  the  centers  (consider,  for  instance  of  a  radial  Gaussian  around 
its  center,  which  is  a  point  in  the  n-dimensional  space  of  inputs).  There  may  be 
also  connections  computing  the  polynomial  term  of  1:  constant  and  linear  terms 
(the  dotted  lines  in  figure  lb)  may  be  expected  in  most  cases. 

Fig.  2  A  three-dimensional  radial  Gaussian  implemented  by  multiplying  two- 
dimensional  Gaussian  and  one-dimensional  Gaussian  receptive  fields.  The  latter 
two  functions  are  synthesized  directly  by  appropriately  weighted  connections 
from  the  sensor  arrays,  as  neural  receptive  fields  are  usually  thought  to  arise. 
Notice  that  they  transduce  the  implicit  position  of  stimuli  in  the  sensor  array 
into  a  number  (the  activity  of  the  unit).  They  thus  serve  the  dual  purpose  of 
providing  the  required  “number”  representation  from  the  activity  of  the  sen¬ 
sor  array  and  of  computing  a  Gaussian  function.  2D  Gaussians  acting  on  a 
retinotopic  map  can  be  regarded  as  representing  2D  “features,”  while  the  radial 
basis  function  represents  the  “template”  resulting  from  the  conjunction  of  those 
lower-dimensional  features. 

Fig.  3.  (a)  The  HyperBF  network  proposed  for  the  recognition  of  a  3D  object 
from  any  of  its  perspective  views  (Poggio  and  Edelman,  1990).  The  network 
attempts  to  map  any  view  (as  defined  in  the  text)  into  a  standard  view,  ar¬ 
bitrarily  chosen.  The  norm  of  the  difference  between  the  output  vector  f  and 
the  standard  view  s  is  thresholded  to  yield  a  0, 1  answer.  The  2N  inputs  ac¬ 
comodate  the  input  vector  v  representing  an  arbitrary  view.  Each  of  the  K 
radial  basis  functions  is  initially  centered  on  one  of  a  subset  of  the  M  views  used 
to  synthesize  the  system  ( K  <  M).  During  training  each  of  the  M  inputs  in 
the  training  set  is  associated  with  the  desired  output,  i.e.,  the  standard  view 
s.  Fig.  3(b)  shows  a  completely  equivalent  interpretation  of  (a)  for  the  special 
case  of  Gaussian  radial  basis  functions.  Gaussian  functions  can  be  synthesized 
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by  multiplying  the  outputs  of  two-dimensional  Gaussian  receptive  fields,  that 
“look”  at  the  retinotopic  map  of  the  object  point  features.  The  solid  circles 
in  the  image  plane  represent  the  2D  Gaussians  associated  with  the  first  radial 
basis  function,  which  represents  the  first  view  of  the  object.  The  dotted  circles 
represent  the  2D  receptive  fields  that  synthesize  the  Gaussian  radial  function 
associated  with  another  view.  The  2D  Gaussian  receptive  fields  transduce  po¬ 
sitions  of  features,  represented  implicitly  as  activity  in  a  retinotopic  array,  and 
their  product  “computes”  the  radial  function  without  the  need  of  calculating 
norms  and  exponentials  explicitly.  Prom  Poggio  and  Girosi  (1990b). 

Fig.  4.  A  hierarchical  scheme  in  which  HyperBF  modules  are  inputs  to  another 
HyperBF  module.  As  an  example,  a  scheme  of  this  type  may  be  used  for  3D 
object  recognition  in  the  general  case  of  spurious  and  missing  features.  Instead 
of  encoding  all  n  features  one  encodes  only  subsets  of  dimensions  d,  where  d  <  n. 
The  inputs  to  each  of  the  first  row  of  modules  is  a  different  set  of  features  of  the 
object;  the  output  is  a  value  between  0, 1  that  indicates  the  degree  of  certainty 
that  the  input  is  the  sought  object.  The  last  module  is  a  decision  module  that 
integrates  the  various  inputs.  Notice  that  all  modules  could  be  synthesized  by 
learning  through  independent  sets  of  examples. 

Fig.  5.  (a)  A  sketch  of  the  neurons  of  the  cerebellum  and  their  connections. 

In  our  conjecture,  these  would  be  the  basic  elements  of  a  HyperBF  network:  the 
mossy  fibers  are  the  inputs,  the  granule  cells  correspond  to  the  various  centers 
and  basis  functions  G(x,x,),  the  Purkinje  cells  correspond  to  the  output  units 
that  summate  the  weighted  activities  of  the  basis  units,  whereas  the  climbing 
fibers  carry  the  “teacher”  signal  yi.  The  strength  of  the  synapses  between  the 
parallel  fibers  and  the  Purkinje  cells  would  correspond  to  the  cQ.  (b)  The 
corresponding  HyperBF  network  is  shown  on  the  right:  it  has  two  basis  functions 
corresponding  to  the  two  granule  cells  on  the  left  and  two  output  summation 
units  corresponding  to  the  two  Purkinje  cells  on  the  left. 

Fig.  6.  Two  problems  in  motor  control:  (a)  determining  the  trajectory  x(t) 
from  a  small  set  of  points  (£i,z,)  on  the  desired  trajectory  and  (b)  computing 
the  field  of  muscle  forces  for  each  of  the  points  on  the  trajectory.  The  figure 
suggests  that  two  different  HyperBF  modules  may  be  used  to  perform  both 
tasks.  In  (a)  a  HyperBF  module  approximates  the  trajectory  from  the  sparse 
points  by  superimposing  Gaussian  distributions  with  the  appropriate  weights 
in  such  a  way  to  satisfy  some  minimum- jerk-like  principle.  In  (b)  a  module  of 
the  HyperBF  type  has  been  synthesized  during  development  and  continuously 
adapted  to  generate  the  appropriate  field  of  forces  for  each  equilibrium  position 
x.  It  is  similar  to  an  approximating  look-up  table.  A  behaviour  of  the  look-up 
table  type  was  suggested  by  Bizzi  because  of  very  recent  experimental  data  (see 
Bizzi  et  al.,  1990). 
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Figure  3  (b) 


Figure  4 
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